Search CORE

13 research outputs found

Partitioning clustering algorithms for protein sequence data sets

Author: A Enright
A Enright
A Herger
A Krause
DW Mount
E Bolten
E Kriventseva
F Can
G Yona
H Cathy
H Spath
J Hartigan
J Shi
KJ Anil
L Kaufman
Mohamed Limam
N Essoussi
Nadia Essoussi
O Sasson
P Cabena
P Clote
P Pipenbacher
P Sperisen
R Ng
R Tatusov
RC Dubes
S Altschul
S Henikoff
S Schneckener
S Van Dongen
SB Needleman
SE Brenner
Sondes Fayech
TF Smith
UM Fayyad
V Faber
V Guralnik
WR Pearson
Z Wu
Publication venue: BioMed Central
Publication date: 01/04/2009
Field of study

Abstract Background Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. Methods We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. Results We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.</p

Crossref

Directory of Open Access Journals

PubMed Central

Learning probabilistic relational models with (partially structured) graph databases

Author: El Abri Marwa
Essoussi Nadia
Leray Philippe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

International audienceProbabilistic Relational Models (PRMs) such as Directed Acyclic Probabilistic Entity Relationship (DAPER) models are probabilistic models dealing with knowledge representation and relational data. Existing literature dealing with PRM and DAPER relies on well structured relational databases. In contrast, a large portion of real-world data is stored in Nosql databases specially graph databases that do not depend on a rigid schema. This paper builds on the recent work on DAPER models, and describes how to learn them from partially structured graph databases. Our contribution is twofold. First, we present how to extract the underlying ER model from a partially structured graph database. Then, we describe a method to compute sufficient statistics based on graph traversal techniques. Our objective is also twofold: we want to learn DAPERs with less structured data, and we want to accelerate the learning process by querying graph databases. Our experiments show that both objectives are completed, transforming the structure learning process into a more feasible task even when data are less structured than an usual relational database

Crossref

Hal-Diderot

Generalization of c-means for identifying non-disjoint clusters with overlap regulation

Author: Ben N'Cir Chiheb-Eddine
Cleuziou Guillaume
Essoussi Nadia
Publication venue: 'Elsevier BV'
Publication date: 13/04/2014
Field of study

International audienceClustering is an unsupervised learning method that enables to fit structures in unlabeled data sets. Detecting overlapping structures is a specific challenge involving its own theoretical issues but offering relevant solutions for many application domains. This paper presents generalizations of the c-means algorithm allowing the parametrization of the overlap sizes. Two regulation principles are introduced, that aim to control the overlap shapes and sizes as regard to the number and the dispersal of the cluster concerned. The experiments performed on real world datasets show the efficiency of the proposed principles and especially the ability of the second one to build reliable overlaps with an easy tuning and whatever the requirement on the number of clusters

HAL - Normandie Université